Exploring Data Exploration

Data Exploration

When it comes to Data Science education, algorithms and techniques get the lion's share of attention. Everyone wants to learn the hottest technique. It was true for Neural Networks in the 80s and Support Vector Machines in the 90s. It feels like it is now true for Deep Learning.

Out in the real world of Data Science, it's not just algorithms that produce results;there is a lot more going on. I learned the hard way (empirically) that only 10% to 20% of the project time is spent in applying an algorithm to data. Many times an algorithm can be applied in few lines of code once the right data is in the right format. It is the path to get this right data in the right format that usually takes most time. Some times as much as 60% of the project time. This 60% includes earthly and mundane things like Data Cleaning, Data Re-shaping, Data Exploration and Feature Engineering.

Real world data is hardly ever clean, so some level of Data Cleaning is unavoidable. Same holds true for Data Re-shaping because the raw Data is hardly ever in the shape and format that is suitable for good analysis. Feature Engineering is also almost always unavoidable because most algorithms can't deal with the sheer volume and complexity of raw data. (Can Deep Learning really be an exception to this?). So what remains is Data Exploration. Since "Exploration" feels like it is discretionary, that's the place to cut some slack, right ? In this series we will do some "exploring" and find out the answer. This is part 1 of the series.

There is one cool thing I must share related to this topic. John Tukey became famous for co-inventing the FFT algorithm. The high-speed LTE connection on your cellphone owes some to that algorithm. He also came up with box plots and made many other contributions to statistics. But one not so well known aspect is that he was one of the pioneers of exploratory data analysis. Here is a video that clearly shows he was ahead of his times.

John Tukey's Video

A quick note about tools and plots: We will use Pandas and Numpy for the bulk of the analysis and Plotly for graphing. Plotly's python API is open sourced recently and is an excellent way to publish graphs online. You can click on any plot and "interact" with it by zooming, panning e.t.c. You can reset the plot by doble clicking on it.

1. The Data Set

The Raw Data Set we will be working with is taken from UCI Machine Learning Data Set repository. Here is the direct link to it: UCI Smartphone Sensor data set . A big thanks to Jorge L. Reyes-Ortiz et.al for sharing the data set.

It contains Accelerometer and Gyroscope signal data collected from cell phones. The signals are collected as users are performing various activities like walking, walking up the stairs, changing posture from sitting to standing e.t.c. All the signals in the data set are labeled with the activity being performed. The objective of the project is to identify the human activity given a new unlabeled observation.

This Human Activity Recognition problem has attracted great interest recently. It has many applications. For example, this could be very useful to Fitbit to classify their user's actions correctly. Other applications include remote elderly care, gesture recognition in video games e.t.c.

1.1. The Cleaned and Transformed Data Set

In order to focus on the Data Exploration part, we will work with a Cleaned and Transformed version of the Data set. Similar to what is described on the data set's webpage, the following actions were performed on the raw data:

  • Removed noise from the signals
  • Broke the signals down into shorter windows with overlapping edges
  • Separated the Accelarion signal into Body and Gravity components

Finally, all of this is put together into a format that is more suitable for analysis. Let's take a look at how this transformed data looks like.

Some top level things. Code comments explain what's going on.

In [1]:
# Bulk of the analysis is in Pandas and Numpy
import pandas as pd
import numpy as np

#Plotly is the Graphing library
import plotly.plotly as py
import plotly.offline as pyoffline
import plotly.tools as tls
import plotly.graph_objs as go

import sys

# Set Plotly to offline mode to render graphs within this notebook
pyoffline.init_notebook_mode()

# Signals are broken down into 128 sample blocks. i.e 2.56 seconds
# 128 samples per block
block_sz = 128
sampling_rate = 50

# index in seconds
index_in_sec = np.linspace(0,(block_sz-1)/50,128)

Loading the transformed data

In [2]:
# Read the Activity labels file
# Use delim_whitespace, its faster than sep='\s' regex (matters for larger files)
labels = pd.read_csv('../Data/activity_labels.txt',names=['activity','label_name'],delim_whitespace=True)
labels = labels.set_index('activity')

# Read in the sample file. This is just a subset of the whole data. Keeping it short for faster exploration
samples_h = open('../Data/FilteredData/subset_sample_data_file.csv',mode='r')

samples = pd.read_csv(samples_h)

# Bookkeeping. Calculate how many blocks we have per activity
num_blocks_per_activity = np.array([len(samples[samples.label == i].index) for i in range(1,13)],dtype=int) // 128

A look at the transformed data

In [3]:
samples.tail(5)
Out[3]:
ac_acc_x ac_acc_y ac_acc_z dc_acc_x dc_acc_y dc_acc_z mag_acc gyro_x gyro_y gyro_z mag_gyro label block_idx
513787 0.007929 0.019201 -0.007170 0.236497 0.391867 0.887224 1.001607 -0.008288 0.021437 0.006498 0.023884 6 4013
513788 0.008215 0.017600 -0.009463 0.236588 0.391887 0.887191 0.999007 -0.003132 0.001691 0.008173 0.008914 6 4013
513789 0.015644 0.011469 -0.022498 0.236685 0.391916 0.887152 0.986922 0.002668 0.024590 -0.003418 0.024969 6 4013
513790 0.018274 0.010507 -0.027842 0.236790 0.391954 0.887107 0.982531 0.010498 0.056734 0.003243 0.057788 6 4013
513791 0.008738 0.011682 -0.009520 0.236901 0.392001 0.887054 0.996677 -0.009644 0.053015 0.014595 0.055827 6 4013
1.1.1 Some Notes about the Data Set

The signals are in the columns of the samples data frame. Any column that has "acc" in the name is an acceleration signal. "ac" prefix implies it is the body acceleration component. "dc" prefix implies the gravity component. They are called so because the gravity component is a constant and the body component changes with time. Gravity component is extracted by low pass filtering the composite acceleration signal. "mag_acc" and "mag_gyro" are the vector magnitude values of the composite acceleration and gyroscope signals.

Some Bookkeeping Information

In [4]:
# Units and title information for all the signals
signal_names = samples.columns.get_values()

# First seven values are acceleration, next 4 gyro and the last 2 are bookkeeping 
signal_units_list = [' (units = g)']*7 + [' (units = rad/seg)' ]*4 + [' (NA)']*2

# Verbose description of signals
signal_titles_list = ['Body Acceleration along X axis', 'Body Acceleration along Y axis', 'Body Acceleration along Z axis',
                        'Gravity Acceleration along X axis','Gravity Acceleration along Y axis','Gravity Acceleration along Z axis',
                        'Vector Magnitude of Acceleration','Gyroscope signal along X axis','Gyroscope signal along Y axis',
                        'Gyroscope signal along Z axis','Vector Magnitude of Gyroscope','What the ?','What the ?'] 

# Build Data frame 
signal_dict = dict(units=signal_units_list,ytitle=signal_titles_list)

signal_df = pd.DataFrame(signal_dict,index=signal_names)

Units and Labels

In [5]:
signal_df.head(5)
Out[5]:
units ytitle
ac_acc_x (units = g) Body Acceleration along X axis
ac_acc_y (units = g) Body Acceleration along Y axis
ac_acc_z (units = g) Body Acceleration along Z axis
dc_acc_x (units = g) Gravity Acceleration along X axis
dc_acc_y (units = g) Gravity Acceleration along Y axis
In [6]:
labels.head(5)
Out[6]:
label_name
activity
1 WALKING
2 WALKING_UPSTAIRS
3 WALKING_DOWNSTAIRS
4 SITTING
5 STANDING

2. Utility Functions

These are just some utility functions used in this post. Code comments explain the purpose of each fuction.

In [7]:
# Function to plot various signals. 
# Example calls: 
# 1) To plot 1 trace of activity 1 with Acceleration in the x-direction
# plot_signals(activities = [1],num_tr_per_act = 1,signal_name = 'ac_acc_x')

# 2) To plot 5 traces per each of the activities 1 and 5 with gyroscope signal in the y-direction
# plot_signals(activities = [1,5],num_tr_per_act = 5,signal_name = 'gyro_y')

def plot_signals(activities = [1,3], num_tr_per_act = 3,signal_name = 'ac_acc_x') :

    #num_tr_per_act
    # Number of blocks per activity to be plotted . i.e number of traces to be plotted per activity type

    #activities
    # Activities to be plotted

    # signal_name
    # Signal to be compared between the two activities
    
    # Colors to be used
    color_scheme = ['Orange', 'Olive', 'DarkBlue','DarkOliveGreen','DarkGoldenRod','DarkGray','DarkGreen','DarkKhaki','DarkMagenta',
                     'DarkOrange','DarkOrchid','DarkRed','DarkSalmon','DarkSeaGreen','DarkSlateBlue','DarkSlateGray']
    
    if len(activities) > len(color_scheme) :
        sys.exit("Slow down! I can't handle so many activities")
    
    signal_units = signal_df.loc[signal_name,'units']
    plot_title = signal_df.loc[signal_name,'ytitle']
    
    ##########################################################################
    # For example, if we want to plot activity 1, mag_gyro signal, the following gives the right signal for it :
    # samples[samples.label==1].mag_gyro.iloc[blk_start_sample:blk_end_sample]
    
    # We want to automate this because we don't know the activity number and signal name before hand. 
    # Generate a list of python expressions like the above and use eval() on them 
    ##########################################################################

    eval_expr = ['{}{}{}{}{}'.format('samples[samples.label==', activity, '].', signal_name, '.iloc[blk_start_sample:blk_end_sample]') for activity in activities]

    # Plotly Figure object consists of data and layout parts
    # create data object and append to it in the loop
    data = go.Data()

    # setup colors for plotly
    markers = [dict(color=mycolor) for mycolor in color_scheme ]
    
    ##########################################################################
    # Generate traces for all the traces to be plotted
    ##########################################################################
    
    # Loop over the number of traces per activity
    for i in range(num_tr_per_act) :
        
        # Sample start and end for this block
        # For ex, 0:128 for first block gets sample[0] to sample[127] 
        blk_start_sample = i * block_sz
        blk_end_sample = blk_start_sample + block_sz
        
        # Change opacity for each signal of the activity
        # Go from 0.2 to 1
        opacity = 0.2 + (0.8 * (i + 1) / num_tr_per_act) 
        
        if i == (num_tr_per_act - 1) :
            legend_on = True
        else :
            legend_on = False

        # Loop over number of signals
        for j in range(len(activities)) : 
            
            # The signal for this activity
            y = eval(eval_expr[j])
            
            # Trace for the plot
            trace = go.Scatter(x=index_in_sec,y=y,name=labels.loc[activities[j]],
                               opacity=opacity,marker=markers[j],
                               legendgroup=str(activities[j]),hoverinfo="y",
                               showlegend=legend_on)

            # Append traces to the data object
            data.append(trace)

    # Create layout object
    layout = go.Layout(title=plot_title,xaxis = dict(title = 'Time (secs)'),
                       yaxis = dict(title = signal_name + signal_units),legend=dict(x=.9,y=100,))

    # Plot
    fig=go.Figure(data=data,layout=layout)

    pyoffline.iplot(fig)

3. Basic Exploration

Let's get started with some basic plots to get a feel for the data. The accelerometers in cellphones measure acceleration relative to gravitational acceleration. This means a free falling body measures zero acceleration while a suspended object measures earth's g.

Since our data has activities like Standing, Sitting and Laying, their acceleration signal should roughly measure g. Lets make some plots to really see if that's true.

In [8]:
# Plot vector magnitude of acceleration for walking and sitting
plot_signals(activities = [4],num_tr_per_act = 1,signal_name = 'mag_acc')
Drawing...

As we can see, the acceleration signal while sitting is very close to g. But it varies quite a bit while walking. The same can be seen to be true for other static positions vs moving activities as well.

In [9]:
# Plot vector magnitude of acceleration for walking and standing
plot_signals(activities = [1,5],num_tr_per_act = 1,signal_name = 'mag_acc')
Drawing...
In [10]:
# Plot vector magnitude of acceleration for laying and sit to stand position change
plot_signals(activities = [8,6],num_tr_per_act = 1,signal_name = 'mag_acc')
Drawing...

Looking at the plots above should hint at the possibility of using standard deviation of mag_acc signal to classify between static and active activities. Lets take a quick look at the summary statistics for walking vs standing. The mean of both signals is ~1 but the standard deviation for walking is 100 times higher compared to standing. Lets take a look at the histograms of these signals and see this visually.

So Standard Deviation of signals can very accurately classify between Static vs Active activity groups. How about classifying within a group, say between Walking and Walking Upstairs?

In [11]:
# DF of one block of data for walking
summary_df = samples[samples.label==1].iloc[0:128].describe()
summary_df.loc[['mean','std'],'mag_acc']
Out[11]:
mean    1.050698
std     0.234850
Name: mag_acc, dtype: float64
In [12]:
# DF of one block of data for Standing
summary_df = samples[samples.label==5].iloc[0:128].describe()
summary_df.loc[['mean','std'],'mag_acc']
Out[12]:
mean    1.031661
std     0.002466
Name: mag_acc, dtype: float64
In [13]:
# Embed this directly from my online plot that has been formatted in GUI
tls.embed('rkoppula', '132')
Out[13]:

Try zooming into the above histogram to actually see how concentraced the signal for Standing is. This is why I like interactive charts.

3.1 More Plots

So how do the signals compare for activities like Walking and Walking Downstairs for example ? In both these activities the user is moving. So we don't expect to see obvious differences but being observant might give some hints.

Lets start by comparing the mag_acc signal between Walking and Walking Upstairs.

In [14]:
# Plot vector magnitude of acceleration for walking and walking upstairs
plot_signals(activities = [1,2],num_tr_per_act = 1,signal_name = 'mag_acc')
Drawing...

There isn't anything very obvious, but if you squint your eye you can see that both the green and oragnge apprear to have a periodic component to them. Moreover, the green seems to change more slowly than the orange. Lets see by overlaying more windows of these signals on the same plot

In [15]:
# Plot many traces of vector magnitude of acceleration for walking and walking upstairs  
plot_signals(activities = [1,2],num_tr_per_act = 5,signal_name = 'mag_acc')
Drawing...

Now a pattern seems to emerge for the green at least. It definitely looks periodic and less wiggly than the orange. This is an indication that taking FFT of these signals will reveal a different power spectrum profile for the two. One potential feature that can be used for classification then might be the percentage of energy in the low frequency bins of their respective FFTs.

We have so far looked only the acceleration signal. Let's take a quick look at the gyro signal. Lets compare mag_gyro between SIT_TO_STAND and SIT_TO_LIE activities.

In [16]:
# Plot many traces of vector magnitude of gyro signal for SIT_TO_STAND and SIT_TO_LIE   
plot_signals(activities = [8,9],num_tr_per_act = 1,signal_name = 'mag_gyro')
Drawing...

Both signals show a concentration of energy in time domain but no clear differences show up. Lets plot more traces of these signals on the same plot.

In [17]:
# Plot many traces of vector magnitude of gyro signal for SIT_TO_STAND and SIT_TO_LIE   
plot_signals(activities = [8,9],num_tr_per_act = 5,signal_name = 'mag_gyro')
Drawing...

No visibile pattern there still. Lets see if various components of gyro signal show any pattern.

In [18]:
# Plot x and z axis of gyro
plot_signals(activities = [8,9],num_tr_per_act = 5,signal_name = 'gyro_x')
Drawing...
In [19]:
plot_signals(activities = [8,9],num_tr_per_act = 5,signal_name = 'gyro_z')
Drawing...

The z direction shows some pattern with green being below x axis most of the time while the orange isn't. A simple mean of of the signal might be able to classify between these two signals well.

4. Summary

We are starting to get the hang of the data set now. We are also gathering hints about the solution and how to design a good feature set. We know that in some cases simple statistics like mean and standard deviation of the signals are enough to separate them. In some cases we know we may have to look at frequency domain statistics. We will dig further into Data Visualization in part 2 and see what other insights we can obtain.